Airline Safety Analysis

The code provided is an analysis of airline safety data. The dataset used in the analysis is the “airline-safety.csv” dataset from FiveThirtyEight’s Github repository.

The code performs the following tasks:

Creates a bar plot for the number of fatal accidents for each airline from 1985 to 1999. Creates a scatter plot to explore the relationship between the available seat kilometers per week and the number of incidents from 2000 to 2014. Creates a histogram to visualize the distribution of the number of fatalities from 2000 to 2014. Performs sampling techniques such as simple random sampling, systematic sampling, stratified sampling, and cluster sampling to draw samples from the dataset. Computes the incident rate for each airline, sorts the airline by incident rate, and displays the top 10 airlines with the highest incident rates. Creates two interactive scatter plots to visualize the number of incidents and fatalities for each airline from 1985 to 1999 and from 2000 to 2014. Overall, the analysis provides a glimpse of airline safety trends from 1985 to 2014, explores the relationship between different variables, and employs various sampling techniques to illustrate how samples can be drawn from the dataset. The interactive scatter plots provide an engaging way to visualize the data and draw insights from it.

src of Dataset

https://github.com/fivethirtyeight/data/blob/master/airline-safety/airline-safety.csv

load all required libraries

library(dplyr)
library(magrittr)
library(rsample)
library(plotly)

read csv

airline_safety <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv")

margin size set

par(mar=c(4, 4, 2, 2))

Bar plot for fatal_accidents_85_99

barplot(airline_safety$fatal_accidents_85_99, 
        names.arg = airline_safety$airline, 
        xlab = "Airline",
        ylab = "Number of Fatal Accidents (1985-1999)",
        main = "Fatal Accidents (1985-1999) by Airline",
        col = "darkblue")

Scatter plot for avail_seat_km_per_week and incidents_00_14

plot(airline_safety$avail_seat_km_per_week, 
     airline_safety$incidents_00_14,
     xlab = "Available Seat Kilometers per Week",
     ylab = "Number of Incidents (2000-2014)",
     main = "Relationship between Available Seat Kilometers per Week and Incidents",
     col = "blue")

Histogram of fatalities_00_14

hist(airline_safety$fatalities_00_14, 
     breaks = 20, 
     xlab = "Number of Fatalities (2000-2014)", 
     ylab = "Frequency", 
     main = "Distribution of Fatalities (2000-2014)")

Draw 1000 random samples of size 30

n <- 30
num_samples <- 1000
sample_means <- replicate(num_samples, mean(sample(airline_safety$fatalities_00_14, n)))

Plot the distribution of sample means

hist(sample_means, 
     breaks = 20, 
     xlab = "Sample Mean", 
     ylab = "Frequency",
     main = "Distribution of Sample Means (n = 30)")

Add a normal curve with the same mean and standard deviation

# Reduce the size of the plot margins
par(mar = c(4, 4, 2, 2))

# Initialize a plot window
plot.new()

# Add a histogram of the data
hist(airline_safety$fatalities_00_14, 
     main = "Distribution of Fatalities (2000-2014)", 
     xlab = "Fatalities")

# Add a normal curve with the same mean and standard deviation
mu <- mean(airline_safety$fatalities_00_14)
sd <- sd(airline_safety$fatalities_00_14)
curve(dnorm(x, mean = mu, sd = sd/sqrt(n)), 
      col = "red", add = TRUE)

# Reset the plot margins to their default values
par(mar = c(5, 4, 4, 2) + 0.1)

Plot the distribution of sample means

hist(sample_means, 
     breaks = 20, 
     xlab = "Sample Mean", 
     ylab = "Frequency",
     main = "Distribution of Sample Means (n = 30)")

Simple random sample

set.seed(123)
simple_sample <- airline_safety[sample(nrow(airline_safety), 50), ]
head(simple_sample)
##                  airline avail_seat_km_per_week incidents_85_99
## 31                  KLM*             1874561773               7
## 15      British Airways*             3179760952               4
## 51      Turkish Airlines             1946098294               8
## 14               Avianca              396922563               5
## 3  Aerolineas Argentinas              385803648               6
## 42    Singapore Airlines             2376857805               2
##    fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 31                     1                3               1                     0
## 15                     0                0               6                     0
## 51                     3               64               8                     2
## 14                     3              323               0                     0
## 3                      0                0               1                     0
## 42                     2                6               2                     1
##    fatalities_00_14
## 31                0
## 15                0
## 51               84
## 14                0
## 3                 0
## 42               83

The conclusion of the simple random sample is that it provides a representative sample of the population, but its accuracy depends on the size of the sample and how the sample is selected. In this case, the sample size is small, which may result in a higher sampling error.

Systematic Sampling

systematic_sample <- airline_safety[seq(1, nrow(airline_safety), 20), ]
head(systematic_sample)
##          airline avail_seat_km_per_week incidents_85_99 fatal_accidents_85_99
## 1     Aer Lingus              320906734               2                     0
## 21      Egyptair              557699891               8                     3
## 41 Saudi Arabian              859673901               7                     2
##    fatalities_85_99 incidents_00_14 fatal_accidents_00_14 fatalities_00_14
## 1                 0               0                     0                0
## 21              282               4                     1               14
## 41              313              11                     0                0

The conclusion of the systematic sample is that it provides an efficient way of sampling if the data is ordered or arranged in a systematic way. However, if there is any periodicity in the data, it can lead to bias.

Stratified Sampling

stratified_sample <- airline_safety %>% 
  group_by(incidents_85_99, incidents_00_14) %>% 
  slice(ifelse(n() >= 5, sample(1:n(), 5), 1:n())) %>% 
  ungroup()
head(stratified_sample)
## # A tibble: 6 × 8
##   airline            avail_sea…¹ incid…² fatal…³ fatal…⁴ incid…⁵ fatal…⁶ fatal…⁷
##   <chr>                    <dbl>   <int>   <int>   <int>   <int>   <int>   <int>
## 1 TAP - Air Portugal   619130754       0       0       0       0       0       0
## 2 Hawaiian Airlines    493877795       0       0       0       1       0       0
## 3 Cathay Pacific*     2582459303       0       0       0       2       0       0
## 4 Finnair              506464950       1       0       0       0       0       0
## 5 Austrian Airlines    358239823       1       0       0       1       0       0
## 6 Gulf Air             301379762       1       0       0       3       1     143
## # … with abbreviated variable names ¹​avail_seat_km_per_week, ²​incidents_85_99,
## #   ³​fatal_accidents_85_99, ⁴​fatalities_85_99, ⁵​incidents_00_14,
## #   ⁶​fatal_accidents_00_14, ⁷​fatalities_00_14

The conclusion of stratified sampling is that it ensures representation of subgroups in the population. This method is beneficial when the population has a few subgroups that are significantly different from each other. In this case, it is seen that the stratified sample represents both the subgroups proportionately.

Cluster Sampling

cluster_sample <- airline_safety %>% 
  group_by(fatalities_00_14) %>% 
  slice(ifelse(n() > 1, sample(1:n(), 1), 1:n())) %>% 
  ungroup()
head(cluster_sample)
## # A tibble: 6 × 8
##   airline             avail_se…¹ incid…² fatal…³ fatal…⁴ incid…⁵ fatal…⁶ fatal…⁷
##   <chr>                    <dbl>   <int>   <int>   <int>   <int>   <int>   <int>
## 1 All Nippon Airways  1841234177       3       1       1       7       0       0
## 2 Philippine Airlines  413007158       7       4      74       2       1       1
## 3 TACA                 259373346       3       1       3       1       1       3
## 4 Air New Zealand*     710174817       3       0       0       5       1       7
## 5 Egyptair             557699891       8       3     282       4       1      14
## 6 Garuda Indonesia     613356665      10       3     260       4       2      22
## # … with abbreviated variable names ¹​avail_seat_km_per_week, ²​incidents_85_99,
## #   ³​fatal_accidents_85_99, ⁴​fatalities_85_99, ⁵​incidents_00_14,
## #   ⁶​fatal_accidents_00_14, ⁷​fatalities_00_14

The conclusion of cluster sampling is that it is useful when the population is geographically dispersed or clustered. It reduces the cost of sampling, as only selected clusters are sampled. However, there may be bias in cluster sampling if the clusters are not a representative sample of the population.

Group by airline and calculate total available seat kilometers and incidents

airline_data <- airline_safety %>%
  group_by(airline) %>%
  summarize(total_seat_km = sum(avail_seat_km_per_week),
            total_incidents = sum(incidents_85_99))

Calculate incident rate for each airline

airline_data <- airline_data %>%
  mutate(incident_rate = (total_incidents / total_seat_km) * 1e9)

Sort by incident rate in descending order

airline_data <- airline_data %>%
  arrange(desc(incident_rate))

Display top 10 airlines by incident rate

head(airline_data, 10)
## # A tibble: 10 × 4
##    airline                total_seat_km total_incidents incident_rate
##    <chr>                          <dbl>           <int>         <dbl>
##  1 Aeroflot*                 1197672318              76          63.5
##  2 Ethiopian Airlines         488560643              25          51.2
##  3 Pakistan International     348563137               8          23.0
##  4 Xiamen Airlines            430462962               9          20.9
##  5 Philippine Airlines        413007158               7          16.9
##  6 Royal Air Maroc            295705339               5          16.9
##  7 Garuda Indonesia           613356665              10          16.3
##  8 Aerolineas Argentinas      385803648               6          15.6
##  9 China Airlines             813216487              12          14.8
## 10 Egyptair                   557699891               8          14.3

check structure of the dataframe

str(airline_safety)
## 'data.frame':    56 obs. of  8 variables:
##  $ airline               : chr  "Aer Lingus" "Aeroflot*" "Aerolineas Argentinas" "Aeromexico*" ...
##  $ avail_seat_km_per_week: num  3.21e+08 1.20e+09 3.86e+08 5.97e+08 1.87e+09 ...
##  $ incidents_85_99       : int  2 76 6 3 2 14 2 3 5 7 ...
##  $ fatal_accidents_85_99 : int  0 14 0 1 0 4 1 0 0 2 ...
##  $ fatalities_85_99      : int  0 128 0 64 0 79 329 0 0 50 ...
##  $ incidents_00_14       : int  0 6 1 5 2 6 4 5 5 4 ...
##  $ fatal_accidents_00_14 : int  0 1 0 0 0 2 1 1 1 0 ...
##  $ fatalities_00_14      : int  0 88 0 0 0 337 158 7 88 0 ...

plot the graph of the Airline Safety 1985-1999 by plotly

plot_ly(data = airline_safety, x = ~incidents_85_99, y = ~fatalities_85_99, 
        color = ~airline, type = "scatter", mode = "markers", hoverinfo = "text",
        text = ~paste("Airline: ", airline, "<br>",
                      "Fatalities (1985-1999): ", fatalities_85_99, "<br>",
                      "Fatalities (2000-2014): ", fatalities_00_14)) %>%
  layout(title = "Airline Safety 1985-1999", xaxis = list(title = "Incidents (1985-1999)"),
         yaxis = list(title = "Fatalities (1985-1999)"), hovermode = "closest")

plot the graph of the Airline Safety 2000 -2014 by plotly

plot_ly(data = airline_safety, x = ~incidents_00_14, y = ~fatalities_00_14, 
        color = ~airline, type = "scatter", mode = "markers", hoverinfo = "text",
        text = ~paste("Airline: ", airline, "<br>",
                      "Fatalities (1985-1999): ", fatalities_85_99, "<br>",
                      "Fatalities (2000-2014): ", fatalities_00_14)) %>%
  layout(title = "Airline Safety 2000 -2014", xaxis = list(title = "Incidents (2000-2014)"),
         yaxis = list(title = "Fatalities (2000-2014)"), hovermode = "closest")

Conclusions

The code provided performs several data visualization and sampling techniques on a dataset named “airline_safety”, which contains information about airline safety from 1985 to 2014. Here are the conclusions drawn from each visualization and analysis in the code:

Bar plot for fatal_accidents_85_99: The plot shows the number of fatal accidents each airline had between 1985 and 1999. The conclusion we can draw is that most airlines had no fatal accidents during this period, and the few that did had one or two.

Scatter plot for avail_seat_km_per_week and incidents_00_14: The plot shows the relationship between available seat kilometers per week and the number of incidents each airline had between 2000 and 2014. The conclusion we can draw is that there is no clear relationship between these two variables.

Histogram of fatalities_00_14: The plot shows the distribution of the number of fatalities each airline had between 2000 and 2014. The conclusion we can draw is that most airlines had no fatalities during this period, and the few that did had less than 500 fatalities.

Distribution of Sample Means (n = 30): The plot shows the distribution of the means of 1000 random samples of size 30 taken from the “fatalities_00_14” variable. The conclusion we can draw is that the distribution of sample means is approximately normal, and the mean of the sample means is close to the population mean.

Simple random sample, systematic sample, stratified sample, and cluster sample: The code performs four sampling techniques on the dataset: simple random sample, systematic sample, stratified sample, and cluster sample. The conclusion we can draw is that each sampling technique has its advantages and disadvantages, and the choice of technique depends on the research question and available resources.

Incident Rate by Airline: The code calculates the incident rate for each airline, which is the number of incidents per available seat kilometer. The conclusion we can draw is that the incident rate varies widely among airlines, and the top 10 airlines with the highest incident rate are all from developing countries.

Scatter plot for incidents_85_99 and fatalities_85_99: The plot shows the relationship between the number of incidents and the number of fatalities each airline had between 1985 and 1999. The conclusion we can draw is that there is a positive correlation between these two variables, and some airlines had a much higher fatality rate than others.

Scatter plot for incidents_00_14 and fatalities_00_14: The plot shows the relationship between the number of incidents and the number of fatalities each airline had between 2000 and 2014. The conclusion we can draw is that there is a positive correlation between these two variables, and some airlines had a much higher fatality rate than others.